Name: Group9-Jobless Jesters
Course: BUDT704
Section: 0502
Date: December 6, 2023
In an era where job markets are rapidly evolving, understanding the dynamics of job postings on prominent platforms such as LinkedIn is crucial. Our project “Job Seeker’s Data Disco”, focuses on this vital area, offering deep insights into the current landscape of job opportunities. We hope this will help college students to get an understanding of their professional journey ahead.
The remote work boom during the pandemic has transformed how industries operate and what they can expect from their employees and how they can better manage resources. Our project examines how these changes are reflected in the LinkedIn job postings, providing a window into the demand for skills tailored to remote and hybrid work environments.
Our analysis goes beyond counting job advertisements. It examines the nature and quality of engagement these postings receive, revealing where the real opportunities for job seekers lie among industries and companies highlighting the most dynamic sectors.
Our project aims to uncover:
Trends in remote work: How the prevalence of remote work in job postings varies across different locations and industries.
Job market dynamics: Understanding the gap between job demand and supply.
Popularity and engagement: Understanding which job types are gaining most traction and engagement.
Skillset Analysis: Identifying the most in demand skills, including technical and soft skills crucial for today’s job market.
How has the prevalence of remote work in job postings changed at different locations and are there differences between the industries?
Which job type has the highest number of postings and what jobs are applied most, is there a gap between the demand and supply and why is that?
How does the traction (viewed) compare to engagement (applied) look like for different jobs. What does that tell about different companies and job type’s popularity?
Which skills are most wanted across all industries and why is that?
| Task | Task Lead | Due Date | Status | Reasoning/Planning |
|---|---|---|---|---|
| Improvement over Project Proposal | Jeet | N/A | Completed | N/A |
| Performing Exploratory Data Analysis | Aarya | Oct 18 | Completed | N/A |
| Data Interpretation | Ning | Oct 23 | Completed | N/A |
| Data Integration | Prashant | Oct 25 | Completed | N/A |
| Filtering unwanted Rows and Columns | Hetvi | Oct 27 | Completed | N/A |
| Imputing required data | Prashant | Oct 28 | Completed | N/A |
| Managing NULL values | Aarya | Oct 28 | Completed | N/A |
| Data Reshaping | Ning | Nov 2 | Completed | N/A |
| Data Quality Check | Akshat | Nov 7 | Completed | N/A |
| Project Update | Hetvi | Nov 10 | Completed | N/A |
| Text Processing for Locations | Ning | Nov 12 | Completed | N/A |
| Text Processing for Skills data | Aarya | Nov 12 | Completed | N/A |
| Data Merging | Prashant | Nov 14 | Completed | N/A |
| Text Processing for Job Titles | Jeet | Nov 14 | Completed | N/A |
| Data Analysis - Question 1 | Prashant | Nov 20 | Completed | N/A |
| Data Analysis - Question 2 | Jeet | Nov 20 | Completed | N/A |
| Data Analysis - Question 3 | Aarya | Nov 20 | Completed | N/A |
| Data Analysis - Question 4 | Hetvi | Nov 20 | Completed | N/A |
| Inferences | Rohan | Nov 24 | Completed | N/A |
| Markdown | Akshat | Nov 28 | Completed | N/A |
| Project Report | Hetvi | Dec 4 | Completed | N/A |
Our project should be graded more heavily on Data Analysis for the following reasons:
We are importing the necessary python libraries that will be required for the project. The following are the libraries required:-
Matplotlib, Seaborn and plotly are required for visualization.
# Importing the required libraries
import pandas as pd
import numpy as np
import re
import plotly.express as px
import matplotlib as plt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
From the Kaggle site, we got a total of 8 csv files.
First we load the job_posting dataset. The dataset has a total of 27 columns and 15886 rows. The dataset talks about the various job postings that were done on LinkedIn by various companies, the job position offered in the posting, the description put by the company, the maximum and the minimum salary offered for that job role, how the work will be, the location of the office, how many people applied for the job, whether remote work is allowed, how many people viewed the job posting, the company url where the job posting was done, the experience that a person is recommended to have, pre-skills or qualifications (if the company requires them), whether the job is sponsored, how the job is going to be (full time, part time or contract based), and currency in which the employees will get paid among other columns.
# Loading the job_postings excel file using pr.read_csv
job_posting_df = pd.read_csv('job_postings.csv')
job_posting_df.head(2) # Printing the entire job_posting dataset
| job_id | company_id | title | description | max_salary | med_salary | min_salary | pay_period | formatted_work_type | location | ... | expiry | closed_time | formatted_experience_level | skills_desc | listed_time | posting_domain | sponsored | work_type | currency | compensation_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 85008768 | NaN | Licensed Insurance Agent | While many industries were hurt by the last fe... | 52000.0 | NaN | 45760.0 | YEARLY | Full-time | Chico, CA | ... | 1.710000e+12 | NaN | NaN | NaN | 1.690000e+12 | NaN | 1 | FULL_TIME | USD | BASE_SALARY |
| 1 | 133114754 | 77766802.0 | Sales Manager | Are you a dynamic and creative marketing profe... | NaN | NaN | NaN | NaN | Full-time | Santa Clarita, CA | ... | 1.700000e+12 | NaN | NaN | NaN | 1.690000e+12 | NaN | 0 | FULL_TIME | NaN | NaN |
2 rows × 27 columns
After performing EDA we got to know that there are many columns in the job_posting file which will not be useful to us in the analysis. Columns such as Max_salary, Min_salary can be eleminated as we are focusing on the analysis of number of Job posting with respect to Location and Industries.
This logic is used to remove several columns from the 'job_posting_df' dataset, streamlining the dataset by dropping columns that are considered unnecessary for the analysis or might not contribute significantly to the related process.
# dropping the unnecessary columns present in the dataset
job_posting_df.drop(['description', 'max_salary','med_salary','min_salary','pay_period','original_listed_time','job_posting_url','application_url','application_type','expiry','closed_time','formatted_experience_level','skills_desc','listed_time','posting_domain','work_type','currency','compensation_type'], axis=1, inplace=True)
Now we load the 'employee counts' dataset. The dataset has a total of 4 columns and 15907 row. The dataset gives details on the company. The dataset talks about how many employees a company has, how many followers a company has on LinkedIn and the time at which this was recorded. The company is identified with a company id.
# loading the employee counts excel file using pd.read_csv
employee_counts_df = pd.read_csv('employee_counts.csv')
Employee_counts had a lot of duplicated data so we will be deleting the duplicates in order to achieve consistency in the data. It will also ensure that each company's data remains unique.
# dropping the duplicates that are present in the company id colmn of the employee counts dataset
employee_counts_df.drop_duplicates(subset='company_id',inplace=True)
Merging the above mentioned columns will help us to understand the work environment by taking into account of the company size. This can help the students to get some valuable insights when they look out for jobs in the companies on they target.
# merging the employee counts dataset and the job postings dataset using pd.merge on the company id column
job_posting_merged_df = pd.merge(job_posting_df, employee_counts_df, on='company_id', how='left')
# dropping the unnecessary time_recorded column using .drop()
job_posting_merged_df.drop('time_recorded',axis=1,inplace=True)
We'll be checking for the null values in the merged data frame and take the necessary steps in dealing with them.
# printing the count of null values in each column
job_posting_merged_df.isnull().sum()
job_id 0 company_id 366 title 0 formatted_work_type 0 location 0 applies 7186 remote_allowed 13546 views 2763 sponsored 0 employee_count 366 follower_count 366 dtype: int64
As we can see from the output above, 'applies', 'remote_allowed' and 'views' have null values.
# filling the rows having NaN with 0 in the applies column using .fillna()
job_posting_merged_df['applies'].fillna(0,inplace=True)
# filling the rows having NaN with 0 in the remote _allowed column using .fillna()
job_posting_merged_df['remote_allowed'].fillna(0,inplace=True)
# filling the rows having NaN with 0 in the views column using .fillna()
job_posting_merged_df['views'].fillna(0,inplace=True)
# printing the count of null values in each column after replacing the rows having NaNs wiht 0s in the applies, remote_allowed and views columns
job_posting_merged_df.isnull().sum()
job_id 0 company_id 366 title 0 formatted_work_type 0 location 0 applies 0 remote_allowed 0 views 0 sponsored 0 employee_count 366 follower_count 366 dtype: int64
# printing the updated dataset
job_posting_merged_df.head()
| job_id | company_id | title | formatted_work_type | location | applies | remote_allowed | views | sponsored | employee_count | follower_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 85008768 | NaN | Licensed Insurance Agent | Full-time | Chico, CA | 0.0 | 0.0 | 5.0 | 1 | NaN | NaN |
| 1 | 133114754 | 77766802.0 | Sales Manager | Full-time | Santa Clarita, CA | 0.0 | 0.0 | 0.0 | 0 | 15.0 | 159.0 |
| 2 | 133196985 | 1089558.0 | Model Risk Auditor | Contract | New York, NY | 1.0 | 0.0 | 17.0 | 0 | 48.0 | 14476.0 |
| 3 | 381055942 | 96654609.0 | Business Manager | Full-time | Forney, TX | 0.0 | 0.0 | 0.0 | 0 | 0.0 | 0.0 |
| 4 | 529257371 | 1244539.0 | NY Studio Assistant | Full-time | New York, NY | 0.0 | 0.0 | 2.0 | 1 | 87.0 | 9790.0 |
After we have finished the processing we can see that the data is pretty much consistent. We have identified and handled the missing values and redundancy in our data set using filtering, imputing and merging the datasets. We have achieved some of the following things with the data cleaning:
However, we will go through the data again and try to refine it and make it more consistent.
Now we load the 'job skills' dataset. The dataset has a total of 2 columns and 27899 rows. The dataset gives details of the skills thata are required for a particular job posting. The dataset talks about the different set of skills a person must have if he/she is applying tothat company for that particular job post. It contains of job_id from the job_postings dataset and a skills column.
# loading the job_skills dataset using .read_csv
skills = pd.read_csv('job_skills.csv')
So now we will transform the pivot the dataframe by enumerating skills required for each job_id and then pivoting the skill column to create separate skill columns for each job.
# creating a new column skill and enumerating the skills
skills['skill']= skills.groupby("job_id")["skill_abr"].cumcount() + 1
# transform each value in the 'skill' column by prefixing it with 'Skill' and attaching a number to it
skills['skill']= skills['skill'].apply(lambda x: f'Skill {x}')
skills.head()
| job_id | skill_abr | skill | |
|---|---|---|---|
| 0 | 3690843087 | ACCT | Skill 1 |
| 1 | 3690843087 | FIN | Skill 2 |
| 2 | 3691763971 | MGMT | Skill 1 |
| 3 | 3691763971 | MNFC | Skill 2 |
| 4 | 3691775263 | MGMT | Skill 1 |
# creating a pivot table using .pivot to transform the skills column into 3 columns representing a different skill
skill_pivot =skills.pivot(index='job_id',columns='skill',values='skill_abr')
skill_pivot.head()
| skill | Skill 1 | Skill 2 | Skill 3 |
|---|---|---|---|
| job_id | |||
| 85008768 | SALE | BD | NaN |
| 133114754 | SALE | BD | NaN |
| 133196985 | ACCT | FIN | NaN |
| 529257371 | DSGN | ART | IT |
| 903408693 | ADM | NaN | NaN |
Now, we merge the merge two dataframes that is 'job_posting_merged_df' and 'skill_pivot'. Both the columns has a common column - job_id, so we merge the two dataframes on that column and we do use a left join. This means that all the rows from the job_posting_merged_df dataframe are included in the resultant dataframe and only the matching relavant rows in the job_id column of the company dataset are included.
Merging the above mentioned dataframes will help us to understand what skills a company requires for a particular job while having all the details related to the job. This can help the students to get some valuable insights as to which skill they need to develop in order to apply for that job.
# merging the two dataframes using .merge and using left join
job = pd.merge(job_posting_merged_df, skill_pivot, on='job_id', how='left', suffixes=('','skill_'))
Now we load the 'companies' dataset. The dataset has a total of 10 columns and 994 rows. The dataset gives details of the companies. The dataset talks about all the companies who have posted jobs on LinkedIn giving all the details required when a person is applying for a job.
# loading the companies dataset using .read_csv
company = pd.read_csv('companies.csv')
# checking for duplicates in the dataset
company.duplicated().unique()
array([False])
Now, we merge the merge two dataframes that is 'job' and 'company'. Both the columns has a common column - company_id, so we merge the two dataframes on that column and we do use a left join. This means that all the rows fromthe job dataframe are included in the resultant dataframe and only the matching relavant rows in the company_id column of the company dataset are included.
Merging the above mentioned dataframes will help us with the . This can help the students to get some valuable insights as to which skill they need to develop in order to apply for that job.
# merging "job" and "company" data on the "company_id" with a left join. In this way we will retain all entries from "Job"
df_job = pd.merge(job, company, on='company_id', how='left')
df_job.head()
| job_id | company_id | title | formatted_work_type | location | applies | remote_allowed | views | sponsored | employee_count | ... | Skill 3 | name | description | company_size | state | country | city | zip_code | address | url | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 85008768 | NaN | Licensed Insurance Agent | Full-time | Chico, CA | 0.0 | 0.0 | 5.0 | 1 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 133114754 | 77766802.0 | Sales Manager | Full-time | Santa Clarita, CA | 0.0 | 0.0 | 0.0 | 0 | 15.0 | ... | NaN | CargoLogin. | CargoLogin is a freight and logistics company ... | 1.0 | California | US | Santa Clarita | 28358 | Constellation Rd | https://www.linkedin.com/company/cargologinllc |
| 2 | 133196985 | 1089558.0 | Model Risk Auditor | Contract | New York, NY | 1.0 | 0.0 | 17.0 | 0 | 48.0 | ... | NaN | Employvision Inc. | Employvision is a talent acquisition and manag... | 1.0 | New Jersey | US | Somerset | 08873 | 285 Davidson Ave | https://www.linkedin.com/company/employvision |
| 3 | 381055942 | 96654609.0 | Business Manager | Full-time | Forney, TX | 0.0 | 0.0 | 0.0 | 0 | 0.0 | ... | NaN | First Baptist Church Forney | NaN | 2.0 | Texas | US | Forney | 75126 | 1003 College St | https://www.linkedin.com/company/fbc-forney |
| 4 | 529257371 | 1244539.0 | NY Studio Assistant | Full-time | New York, NY | 0.0 | 0.0 | 2.0 | 1 | 87.0 | ... | IT | Ken Fulk Inc | Ken Fulk Inc, is a creative think tank based o... | 1.0 | California | US | San Francisco | 94103 | 310 Seventh Street | https://www.linkedin.com/company/ken-fulk-inc. |
5 rows × 23 columns
While going through the data we found that the location column is not consistent. We identified 5 cases: -
1] Only the city is given
2] Only the country is given
3] The city and the state are given
4] The state and country are given
5] The city, state and the country are given
We decided to keep all the datapoints in the location column to be in city-state format. In order to do this, we have done processing of the location data.
# counting the unique values in "Location" Column
df_job['location'].value_counts()
United States 1133
New York, NY 398
Chicago, IL 267
Houston, TX 243
Atlanta, GA 207
...
Visalia, CA 1
Page, AZ 1
Blacksburg, VA 1
Windsor, WI 1
Vicksburg, MS 1
Name: location, Length: 3010, dtype: int64
In order to make the data as consistent as possible, certain locations need to be renamed. A dictionary is created where the locations that need to be changed are entered along with the new name with which they want to replace.
# creating a new dictionary where certain locations (key) are given their new name (value)
new_location = {'United States': 'United States, US', 'New York City Metropolitan Area': 'New York City Metropolitan Area,NY', 'San Francisco Bay Area': 'San Francisco Bay Area, CA',
'Los Angeles Metropolitan Area' : 'Los Angeles Metropolitan Area, CA', 'Washington DC-Baltimore Area': 'Washington DC-Baltimore Area, DC',
'Dallas-Fort Worth Metroplex': 'Dallas-Fort Worth Metroplex, TX', 'Atlanta Metropolitan Area': 'Atlanta Metropolitan Area,GA',
'Greater Chicago Area': 'Greater Chicago Area, IL', 'Greater Houston': 'Greater Houston, TX', 'San Diego Metropolitan Area':'San Diego Metropolitan Area, CA',
'Greater Minneapolis-St. Paul Area': 'Greater Minneapolis-St. Paul Area, MN'}
# using .replace to rename the selected locations in the new_location dictionary
df_job['location'].replace(new_location,inplace=True)
# Filtering to keep only rows where "location" contains "," and creating a saperate copy
jobs= df_job[df_job['location'].str.contains(',')==True].copy()
# Slice the State Abbreviations from the "Location" column
jobs['State'] = jobs['location'].map(lambda s: s[-2:])
jobs['City'] = jobs['location'].map(lambda c: c[:-2]).str.replace(',','')
jobs.columns
Index(['job_id', 'company_id', 'title', 'formatted_work_type', 'location',
'applies', 'remote_allowed', 'views', 'sponsored', 'employee_count',
'follower_count', 'Skill 1', 'Skill 2', 'Skill 3', 'name',
'description', 'company_size', 'state', 'country', 'city', 'zip_code',
'address', 'url', 'State', 'City'],
dtype='object')
# Check the values in "State" Column
jobs['State'].value_counts()
CA 1974 TX 1360 US 1133 NY 919 FL 757 es 717 IL 590 GA 482 NC 449 PA 441 MA 438 VA 436 WA 408 OH 397 NJ 386 MI 329 AZ 311 MD 310 MN 306 CO 287 TN 259 SC 233 WI 230 MO 210 IN 186 DC 159 CT 145 UT 141 OR 127 NV 124 ea 120 OK 111 IA 110 KS 98 LA 97 AL 95 KY 92 ID 68 AR 66 DE 56 NE 54 NM 49 MS 45 RI 40 NH 35 WV 34 HI 33 AK 31 ND 29 MT 20 SD 20 WY 18 ME 15 VT 13 Name: State, dtype: int64
After further checking the values in "State" and "City" Columns, we discovered that "State" Column has some value which are not states like 'ea', 'es', while we can find the state information in the corresponding "City" Column. So we define a funtion called "findstate" to find the information of states based on the information in the "City" column, to see if there is a match the 50 states of United States, if we do, we return it to the list, else we return 'NA'. Then we can update the "State" Column value and change it into state abbreviations for further analysis.
# creating a dictionary "state_abb" mapping US states to their respective abbrivations
state_abb = {'Alaska': 'AK', 'Alabama': 'AL', 'Arkansas': 'AR', 'Arizona': 'AZ', 'California': 'CA', 'Colorado': 'CO', 'Connecticut': 'CT',
'District of Columbia': 'DC', 'Delaware': 'DE', 'Florida': 'FL', 'Georgia': 'GA', 'Hawaii': 'HI', 'Iowa': 'IA', 'Idaho': 'ID',
'Illinois': 'IL', 'Indiana': 'IN', 'Kansas': 'KS', 'Kentucky': 'KY', 'Louisiana': 'LA', 'Massachusetts': 'MA', 'Maryland': 'MD',
'Maine': 'ME', 'Michigan': 'MI', 'Minnesota': 'MN', 'Missouri': 'MO', 'Mississippi': 'MS', 'Montana': 'MT', 'North Carolina': 'NC',
'North Dakota': 'ND', 'Nebraska': 'NE', 'New Hampshire': 'NH', 'New Jersey': 'NJ', 'New Mexico': 'NM', 'Nevada': 'NV',
'New York': 'NY', 'Ohio': 'OH', 'Oklahoma': 'OK', 'Oregon': 'OR', 'Pennsylvania': 'PA', 'Rhode Island': 'RI', 'South Carolina': 'SC',
'South Dakota': 'SD', 'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT', 'Virginia': 'VA', 'Vermont': 'VT', 'Washington': 'WA',
'Wisconsin': 'WI', 'West Virginia': 'WV', 'Wyoming': 'WY'}
# creating a dictionary "state" which has all us states and territories
state = ['Alaska', 'Alabama', 'Arkansas', 'Arizona', 'California', 'Colorado', 'Connecticut', 'District of Columbia', 'Delaware',
'Florida', 'Georgia', 'Hawaii', 'Iowa', 'Idaho', 'Illinois', 'Indiana', 'Kansas', 'Kentucky', 'Louisiana', 'Massachusetts',
'Maryland', 'Maine', 'Michigan', 'Minnesota', 'Missouri', 'Mississippi', 'Montana', 'North Carolina', 'North Dakota', 'Nebraska',
'New Hampshire', 'New Jersey', 'New Mexico', 'Nevada', 'New York', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island',
'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Virginia', 'Vermont', 'Washington', 'Wisconsin', 'West Virginia',
'Wyoming']
# 'ea' and 'es' are filted subsets of the job dataframe, where state is 'ea' or 'es' respectively
ea =jobs[jobs['State']=='ea'].replace('ea','')
es = jobs[jobs['State']=='es'].replace('es','')
# creating a dataframe concatinating "ea" and "es" dataframes
df_state = pd.concat([ea, es], axis=0)
# Creating a function that that identifies and loop through each row in "City" Column,
# split the value into words and loop through the words see if there is a match with the 50 states of United States
# Appends the found state name or "NA" to the result list if no state name us detected.
def findstate(s, state):
result = []
for value in s:
items = value.split(' ')
found_state = False
for item in items:
if item in state:
result.append(item)
found_state = True
if not found_state:
result.append('NA') # Appends the found state name or "NA" to the result list if no state name us detected.
return result
# Assigning new state names to "state" column in "df_state" using the findstate function on the "city" column
df_state['State'] =findstate(df_state['City'],state)
# replacing state names with their respective abbrivations using our "state_abb" dictionary.
df_state['State'].replace(state_abb, inplace=True)
# creating a new column and copying the values from the state
df_state['job_state']= df_state['State']
# mearging jobs dataframe with "df" on "job_id" using a left join, resluting in a new dataframe
merge_jobs = pd.merge(jobs, df_state[['job_id','job_state']], on='job_id',how='left')
# updating "state" in "merge_job" to "job_state" where "job_state" is not missing, otherwise keeping the original
merge_jobs['State'] = np.where(merge_jobs['job_state'].notna(), merge_jobs['job_state'], merge_jobs['State'])
# Dropping the column after update
merge_jobs.drop('job_state',axis=1,inplace=True)
# Check the values in "State" Column
merge_jobs['State'].value_counts().head()
CA 2078 TX 1451 US 1133 NY 919 FL 798 Name: State, dtype: int64
To do our analysis, we need merge another dataset called "company_industries" to our dataframe to get the industry information of each company.
# reading data from "company_industries.csv" and dropping duplicates
company_industry = pd.read_csv('company_industries.csv').drop_duplicates()
# Merging the dataframes "merge_jobs" and "company_industry" on "company_id" using a left join
df_jobs= pd.merge(merge_jobs, company_industry, on= 'company_id', how='left')
Select the columns we need for the final dataframe we will use for the analysis work, which are 'title','City','State','formatted_work_type','remote_allowed', 'applies','views', 'sponsored','Skill 1', 'Skill 2', 'Skill 3','name','employee_count','follower_count','industry','city','state', then we rename the names of the columns to make it clearer and easier to understand.
# Creaing a job_positions dataframe selecting specific columns from df_jobs and making a separate copy.
job_postings = (df_jobs[['title','City','State','formatted_work_type','remote_allowed', 'applies','views', 'sponsored','Skill 1', 'Skill 2', 'Skill 3','name','employee_count','follower_count','industry','city','state']]).copy()
# Defining a dictionary "new_columns" to map existing column names to new, more descriptive names of columns
new_columns = {'title': 'Job_Title','City':'Job_City','State':'Job_State','formatted_work_type':'Job_Type', 'applies':'Applies', 'remote_allowed': 'Remote', 'views': 'Views', 'sponsored': 'Sponsored',
'name':'Company_Name', 'employee_count':'Employees', 'follower_count': 'Followers', 'industry': 'Company_Industry', 'city':'Company_City', 'state':'Company_State'}
# Renaming the names of the columns to new columns
job_postings.rename(columns=new_columns,inplace=True)
At last, we finalize our dataframe by doing some more data processing such as fill in the missing values and change the data type.
We assue each company which posted the jobs will have at least one employee, so we fill in the missing values in "Employee" Column with 1; We assue each company with a missing value in "Follower" have 0 followers in Likedin, so we fill in the missing values in "Follower" Column with 0.
We change the data type in Columns "Remote" "Applies" "Views" "Employees" and "Followers" to integer to make the dataframe look clearer and more consistent.
# Counting the nulls in the dataframe
job_postings.isnull().sum()
Job_Title 0 Job_City 0 Job_State 0 Job_Type 0 Remote 0 Applies 0 Views 0 Sponsored 0 Skill 1 618 Skill 2 6421 Skill 3 13298 Company_Name 399 Employees 353 Followers 353 Company_Industry 380 Company_City 399 Company_State 402 dtype: int64
# Filling the missing values to 1, updating the dataframe inplace
job_postings.Employees.fillna(1,inplace=True)
job_postings.Employees = job_postings.Employees.replace(0,1)
# Filling missing values in the "Followers" column of "job_postings" with 0, applying the changes directly to the dataframe.
job_postings.Followers.fillna(0,inplace=True)
# Converting the datatypes of the following columns to integers.
job_postings[['Remote','Applies','Views','Employees','Followers']] =job_postings[['Remote','Applies','Views','Employees','Followers']].astype(int)
After processing the data, the final dataset is ready for analysis. The final dataset is a combination of 5 datasets consisting of only the required and necessary columns. The dataset has 15593 rows and 17 columns. The dataset gives details on the job postings, the skills required, whether the job offers remote work, and the details about the companies.
# Printing the dataframe
job_postings.head()
| Job_Title | Job_City | Job_State | Job_Type | Remote | Applies | Views | Sponsored | Skill 1 | Skill 2 | Skill 3 | Company_Name | Employees | Followers | Company_Industry | Company_City | Company_State | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Licensed Insurance Agent | Chico | CA | Full-time | 0 | 0 | 5 | 1 | SALE | BD | NaN | NaN | 1 | 0 | NaN | NaN | NaN |
| 1 | Sales Manager | Santa Clarita | CA | Full-time | 0 | 0 | 0 | 0 | SALE | BD | NaN | CargoLogin. | 15 | 159 | Transportation/Trucking/Railroad | Santa Clarita | California |
| 2 | Model Risk Auditor | New York | NY | Contract | 0 | 1 | 17 | 0 | ACCT | FIN | NaN | Employvision Inc. | 48 | 14476 | Staffing & Recruiting | Somerset | New Jersey |
| 3 | Business Manager | Forney | TX | Full-time | 0 | 0 | 0 | 0 | NaN | NaN | NaN | First Baptist Church Forney | 1 | 0 | Religious Institutions | Forney | Texas |
| 4 | NY Studio Assistant | New York | NY | Full-time | 0 | 0 | 2 | 1 | DSGN | ART | IT | Ken Fulk Inc | 87 | 9790 | Design | San Francisco | California |
After we have finished the processing we can see that the data is pretty much consistent. We have identified and handled the missing values and redundancy in our data set using filtering, imputing and merging the datasets. We have achieved some of the following things with the data cleaning:
However, we will go through the data again and try to refine it and make it more consistent.
# Check the missing value of the dataframe
job_postings.isnull().sum()
Job_Title 0 Job_City 0 Job_State 0 Job_Type 0 Remote 0 Applies 0 Views 0 Sponsored 0 Skill 1 618 Skill 2 6421 Skill 3 13298 Company_Name 399 Employees 0 Followers 0 Company_Industry 380 Company_City 399 Company_State 402 dtype: int64
# Find out the Job Type distribution
job_postings['Job_Type'].value_counts().to_frame()
| Job_Type | |
|---|---|
| Full-time | 12593 |
| Contract | 1722 |
| Part-time | 988 |
| Temporary | 119 |
| Internship | 111 |
| Other | 52 |
| Volunteer | 8 |
# Find out the Sponsorship distribution
job_postings['Sponsored'].value_counts().to_frame()
| Sponsored | |
|---|---|
| 0 | 11086 |
| 1 | 4507 |
# Find out the Top 10 industries which have the most job postings
job_postings['Company_Industry'].value_counts().to_frame().head()
| Company_Industry | |
|---|---|
| Staffing & Recruiting | 2184 |
| Information Technology & Services | 1754 |
| Hospital & Health Care | 1367 |
| Retail | 954 |
| Computer Software | 855 |
How has the prevalence of remote work in job postings changed at different locations and are there differences between the industries?
Addressing the question how the prevalence of remote work in job postings has changed across different locations and industries is crucial for job seekers in today’s evolving job market for the following reasons:
It will help job seekers understand the dynamics of the remote work market and help them align their strategies with industries and locations where remote work is prevalent.
Provides insights into shifting nature of the work environment, enabling job seekers to make informed decisions about relocating or pursuing career in a different sector.
Studies have shown that remote work have great positive impact on the work life balance. Analysing these trends is vital for job seekers in a job market with increasingly influenced by remote work possibilities.
In our analysis, we aim to quantify the rise of remote work opportunities, revealing how this trend intersects with industry sectors and geographic regions. By calculating and visualising the percentage of remote work offerings, we provide job seekers with actionable data to navigate the job market effectively.
# Calculating the percentage of Job postings that allow remote work
remote_jobs_percentage = job_postings['Remote'].mean() * 100
# Displaying the percentage
print(f'The percentage of Job Posting that allow remote work is {remote_jobs_percentage:.2f}%')
The percentage of Job Posting that allow remote work is 14.65%
# Calculating the percentage of jobs that allow remote work with respect to each industry
remote_jobs_industries = job_postings.groupby('Company_Industry')['Remote'].mean().sort_values(ascending = False) * 100
# Displaying the data
top_10 = 10
remote_jobs_industries.head(top_10)
Company_Industry Fine Art 100.000000 Think Tanks 100.000000 Professional Training & Coaching 85.714286 Writing & Editing 75.000000 E-learning 71.428571 Market Research 66.666667 Political Organization 66.666667 Venture Capital & Private Equity 65.000000 Investment Management 50.000000 Computer Hardware 40.000000 Name: Remote, dtype: float64
In this question, the prevalence of remote work in different states and across various indsutries is being discussed. In order to display it visually, we use Bar Chart to show the prevalence of remote work across the top 10 industries and a choropleth map of the United States to show the prevalence of remote work in different states
Why take a Bar Chart?
Why take a Choropleth Map?
In order to visualize the prevalence of remote work in different states, Choropleth Map is the best way to visualize this.
# Importing plotly
import plotly.express as px
# Resetting the Index for top 30 records
top_30_remote_jobs_industries = remote_jobs_industries.reset_index().head(30)
# Plotting the graph to display the percentage of remote work availability
bar_chart = px.bar(top_30_remote_jobs_industries,x = 'Company_Industry', y='Remote', title='Remote Work Availability by Industry')
bar_chart.update_layout(width = 1000, height = 600, xaxis_title='Industry', yaxis_title='Percentage Remote Work Availability')
bar_chart.update_xaxes(tickangle=45)
bar_chart.show()
# Calculating the percentage of jobs that allow remote work for each state
remote_work_by_state = (job_postings.groupby('Job_State')['Remote'].mean() * 100).reset_index(name = 'Proportion')
# Creating a choropleth map
remote_work_by_state_map = px.choropleth(
remote_work_by_state,
locations='Job_State',
locationmode='USA-states',
color='Proportion',
color_continuous_scale="blues",
title='Percentage of Job Postings with Remote Work by State',
labels={'percentage': 'Percentage of Job Postings by State'},
range_color=(0, 40)
)
# Update the layout to include a US map
remote_work_by_state_map.update_geos(scope='usa')
remote_work_by_state_map.update_layout(width=1000, height=600)
# Displaying the map
remote_work_by_state_map.show()
Wyoming, Maine and Missouri are the leading states in terms of remote work proportion and these are our 3 inferences from the observation:
1. Quality of life as a remote work driver:
2. Impact of population and geography as a remote work driver:
3. Economic Diversity as remote work driver:
Which job type has the highest number of postings and what jobs are being applied to the most, is there a gap between the demand and supply and why is that?
This question answers the job type with the highest number of job postings and which has the most number of applications. This helps with the following:
1] Identification of in-demand jobs: - The questions helps us to identify the job which is being preferred by a lot of people.
2] Helps the HR and companies to build strategies: - It gives insights to the HR and the companies to identify which job types are in high demand but low in supply or vice versa. This can help help them understand these dynamics in order to come up with effective workforce planning.
3] Guidance for Job-Seekers and Careers Changers: - It guides the individuals who are looking for jobs or attempting to change career paths with the skills needed in order to be competitive for a in-demand job.
Our analysis targets understanding the labour market dynamics by examining job postings and application volumes across various industries. By aggregating and comparing the number of job postings (demand) and applications received (supply), we aim to identify which industries are experiencing high demand for labour and those attracting the most applicants. We are visualzing the relation between top 10 industries by their ratio of applications to job postings. This analysis aims to offer valuable insights for job seekers and employers alike, giving them more information to help their decision making.
# Group by Company_Industry and calculate the sum of Applies and count of Job_Title for each Industry
grouped_df = job_postings.groupby('Company_Industry').agg({'Applies': 'sum', 'Job_Title': 'count'}).reset_index()
# Rename the columns
grouped_df = grouped_df.rename(columns={'Applies': 'total_applies', 'Job_Title': 'postings_count'})
# Display the grouped dataframe
grouped_df.head()
| Company_Industry | total_applies | postings_count | |
|---|---|---|---|
| 0 | Accounting | 630 | 115 |
| 1 | Airlines/Aviation | 107 | 14 |
| 2 | Alternative Medicine | 1 | 2 |
| 3 | Animation | 1 | 1 |
| 4 | Apparel & Fashion | 1819 | 116 |
#Sort the dataframe by postings_count in descending order
sorted_postings=grouped_df.sort_values(by='postings_count', ascending=False)
#Display the job type with the highest number of postings
highest_job_postings=sorted_postings.iloc[0]['Company_Industry']
#Display the count of the highest postings
highest_postings_count=sorted_postings.iloc[0]['postings_count']
print(f'The job type with the highest number of postings is: {highest_job_postings} with {highest_postings_count} postings ')
The job type with the highest number of postings is: Staffing & Recruiting with 2184 postings
Since we see that the highest number of job postings is Staffing and Recruiting we can infer that these job type is an essential function to many industries. As businesses grow and experience turnover there is always a need to fill in these positions. Thus this constant demand is why Staffing and Recruiting has a high volume of job postings.
#Sort the dataframe by total_applies in descending order
sorted_applies=grouped_df.sort_values(by='total_applies', ascending=False)
#Display the job type with the highest number of applicants
highest_job_applies=sorted_postings.iloc[0]['Company_Industry']
#Display the count of the highest applicants
highest_total_applies_count=sorted_applies.iloc[0]['total_applies']
print(f'The job type with the highest number of applicants is: {highest_job_applies} with {highest_total_applies_count} applicants ')
The job type with the highest number of applicants is: Staffing & Recruiting with 42912 applicants
Since we see that the highest number of job applicants is for Staffing and Recruiting we can infer that there is always a constant supply of people that want to fill in for these roles. This is probably because you do not need much experience when it comes to applying for these jobs. Any person of any age and at any point of their life can apply and work in most staffing and recuriting postions.
#Create a new column and store the ratio of total_applies to postings_count
grouped_df['ratio_applies_postings']=grouped_df['total_applies']/grouped_df['postings_count']
#Increment index values to start the index from 1
grouped_df.index=grouped_df.index+1
# Displaying the data
grouped_df.head()
| Company_Industry | total_applies | postings_count | ratio_applies_postings | |
|---|---|---|---|---|
| 1 | Accounting | 630 | 115 | 5.478261 |
| 2 | Airlines/Aviation | 107 | 14 | 7.642857 |
| 3 | Alternative Medicine | 1 | 2 | 0.500000 |
| 4 | Animation | 1 | 1 | 1.000000 |
| 5 | Apparel & Fashion | 1819 | 116 | 15.681034 |
#Sort grouped_df to show the ratio in descending order
df_sorted_ratio=grouped_df.sort_values(by='ratio_applies_postings', ascending=False)
#Display the top 10 industries with highest ratios
df_sorted_ratio.head(top_10)
| Company_Industry | total_applies | postings_count | ratio_applies_postings | |
|---|---|---|---|---|
| 67 | Investment Management | 158 | 2 | 79.000000 |
| 140 | Writing & Editing | 286 | 4 | 71.500000 |
| 129 | Think Tanks | 109 | 2 | 54.500000 |
| 52 | Health, Wellness & Fitness | 4734 | 91 | 52.021978 |
| 28 | Cosmetics | 1428 | 28 | 51.000000 |
| 21 | Computer Hardware | 231 | 5 | 46.200000 |
| 134 | Venture Capital & Private Equity | 809 | 20 | 40.450000 |
| 72 | Legislative Office | 31 | 1 | 31.000000 |
| 117 | Research | 1480 | 57 | 25.964912 |
| 16 | Civic & Social Organization | 308 | 12 | 25.666667 |
We are using a line graph because it is useful for showing trends or changes over a continuous variable such as these different industries. The line graph allows us to highlight varations in the data over these specific categories.
Supply: In this case our supply is the amount of applicants which refers to the total_applies column in the dataframe. From the industry's point of view the demand that particular industry has for more employees is the amount of postings they have put up.
Demand: In this case our demand is the amount of postings which refers to the postings_count column in the dataframe. From the industry's point of view the supply of the applicants is the number of total applicants.
Since the percentage of supply (total_applies) is greater than 1 there is more of a supply for that particular industry compared to the demand of the industry.
# Select the top 10 values for the line graph
top_10 = df_sorted_ratio.head(10).copy()
# Map emojis to the respective industry's ratio
emojis = ['🏦', '📝', '💡', '🏋️♀️', '💄', '💻', '💸', '🧑⚖️', '🏛️', '🔬']
# Add emojis to the 'text' column
top_10.loc[:, 'text'] = [f"{emoji} {percentage:.2f}" for emoji, percentage in zip(emojis, top_10['ratio_applies_postings'])]
# Create the line graph
fig = px.line(top_10,
x='Company_Industry',
y='ratio_applies_postings',
text='text',
labels={'ratio_applies_postings': 'Ratio (Applies to Posting)'},
title='Top 10 Industries by Highest Ratio',
height=700)
# Edit layout
fig.update_layout(
xaxis_title='Company Industry',
yaxis_title='Ratio (Applicants/Job Postings)',
plot_bgcolor='black',
paper_bgcolor='black',
font_color='white',
margin=dict(l=10, r=1, t=50, b=50),
)
fig.update_traces(
textfont=dict(size=17),
line_shape='linear',
line=dict(width=2, color='aqua'),
textposition='top right'
)
# Show the interactive plot
fig.show()
The analysis uncovers a lot of interesting insights about demand and supply.
There are two major inferences that we would like to discuss:
High competition in specific industries: The graph suggests that industries like investment management and writing and editing are experiencing a high ratio of applications to job postings, which could indicate a highly competitive job market in these fields. This may be due to surplus of qualified candidates in relation to the number of available positions, or it could reflect a strong desirability of jobs within these industries, leading to many applications for relatively fewer openings. A lot of people take up writing and editing as a side hustle as well. Investment Management and writing and editing can be completely remote which motivates more people to apply.
Potential skill gap: On the other hand, industries such as civic and social organization and research have a lower ratio of applications to job postings. This could infer that there is skills gap where the number of available jobs exceed the number of applicants who possess the required qualifications, or it might suggest less interest in there sectors among job seekers. For HR and strategic planning, these industries might need to invest more in training programs to build the required workforce or in recruitment marketing to attract more applicants.
Why we want to explore this perspective of the job postings? As graduate students, we will soon be job seekers. The number of times a job posting is viewed by potential candidates can be considered as traction. The number of applications submitted for a particular job reflects engagement. Higher traction and engagement could suggest that the job is attractive to a larger pool of candidates, possibly because of factors like company reputation, job salary, or career growth opportunities. Since we don't have other data or information as salary and career growth oppotunities here, we can look it up through the company reputation. Larger companies tend to be more famous and have more influence in the industries. Some large companies are the dominant players in their own markets, that give higer reputations to them rather than medium to small companies. So we expect job postings from large companies will attract more views and applies due to their brand recognition and resources for extensive job advertising. Medium and small companies might have less traction and engagement, especially if they are newly established or operate in a niche market.
To look into this problem, we first check the range of total numbers of employees of the companies posted jobs, and we discovered that the range is from 1 to 829,111. So we divided companies into 8 company sizes.
We divided companies into these four categoies for further analysis: small company, medium company, large company and giant company. Since giant companies tend to be influential companies influence or sometimes dominate in their industries, a career with them seems to be more promising, so we expected to find most views and applies of their job postings.
| Company Category | Company Size | Employee Amount | Description | Reputation Scale |
|---|---|---|---|---|
| Small Company | 1, 2 | 0-1,000 | Newly established or operate in a niche market | Little to no |
| Medium Company | 3, 4 | 1,000-10,000 | N/A | Medium to little |
| Large Company | 5, 6, 7 | 10,000-300,000 | N/A | Tend to be high |
| Giant Company | 8 | 300,000-1,000,000 | Influential companies in their industries | Highest with global presence |
# Select the columns needed for this analysis and created a new dataframe
view_apply = job_postings[['Job_Title','Job_Type','Job_State','Remote','Sponsored','Company_Name','Employees','Company_Industry','Views','Applies']].copy()
view_apply.head()
| Job_Title | Job_Type | Job_State | Remote | Sponsored | Company_Name | Employees | Company_Industry | Views | Applies | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Licensed Insurance Agent | Full-time | CA | 0 | 1 | NaN | 1 | NaN | 5 | 0 |
| 1 | Sales Manager | Full-time | CA | 0 | 0 | CargoLogin. | 15 | Transportation/Trucking/Railroad | 0 | 0 |
| 2 | Model Risk Auditor | Contract | NY | 0 | 0 | Employvision Inc. | 48 | Staffing & Recruiting | 17 | 1 |
| 3 | Business Manager | Full-time | TX | 0 | 0 | First Baptist Church Forney | 1 | Religious Institutions | 0 | 0 |
| 4 | NY Studio Assistant | Full-time | NY | 0 | 1 | Ken Fulk Inc | 87 | Design | 2 | 0 |
# Drop duplicates in the "Employee" Column to find the range of the values
view_apply['Employees'].drop_duplicates().describe()
count 2501.000000 mean 11034.045182 std 36625.131630 min 1.000000 25% 716.000000 50% 2417.000000 75% 8069.000000 max 829111.000000 Name: Employees, dtype: float64
# Create bins and bin labels for the "Employee" Column to categorize different sizes of the companies
bins = [0, 200, 1000, 5000, 10000, 50000, 100000, 500000, 1000000]
bin_labels = [1, 2, 3, 4, 5, 6, 7, 8]
# Create a new column as "Company_Size" for further analysis
view_apply['Company_Size'] = pd.cut(view_apply['Employees'], bins, labels=bin_labels)
view_apply.head()
| Job_Title | Job_Type | Job_State | Remote | Sponsored | Company_Name | Employees | Company_Industry | Views | Applies | Company_Size | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Licensed Insurance Agent | Full-time | CA | 0 | 1 | NaN | 1 | NaN | 5 | 0 | 1 |
| 1 | Sales Manager | Full-time | CA | 0 | 0 | CargoLogin. | 15 | Transportation/Trucking/Railroad | 0 | 0 | 1 |
| 2 | Model Risk Auditor | Contract | NY | 0 | 0 | Employvision Inc. | 48 | Staffing & Recruiting | 17 | 1 | 1 |
| 3 | Business Manager | Full-time | TX | 0 | 0 | First Baptist Church Forney | 1 | Religious Institutions | 0 | 0 | 1 |
| 4 | NY Studio Assistant | Full-time | NY | 0 | 1 | Ken Fulk Inc | 87 | Design | 2 | 0 | 1 |
# Check the company numbers of each company size
view_apply.Company_Size.value_counts()
1 4288 3 3192 5 2823 2 2753 4 1360 6 692 7 381 8 104 Name: Company_Size, dtype: int64
# Create the visualization for the views and applies in job postings of different company size
size = view_apply.Company_Size.astype(int) ** 3.5
# creating a scatter plot with views and applies on X and Y axies respectively
view_apply.plot.scatter(x='Views',y='Applies',figsize= (12,6),s=size,c='Company_Size',colormap='viridis')
# setting the limits for better visualisation
plt.axis(ymin=0, ymax=1200, xmin=1, xmax=3000)
# performing a linear regression
m, b = np.polyfit(view_apply.Views, view_apply.Applies, 1)
# plotting x and y (y = mx + b )
plt.plot(view_apply.Views, m*view_apply.Views + b)
[<matplotlib.lines.Line2D at 0x142431110>]
From the graph above we can discover several findings:
The third observation didn't fall into our expectation, quite contradict, it was the opposite. Based on this analysis, we discussed and infered that maybe some drawbacks of the large/giant companies and some benefits working in small companies contribute to this situation.
As it is well known, job postings from larger companies often have more extensive requirements. They might seek candidates with specific educational backgrounds, years of experience, and a diverse set of skills. THe higher standard of extensive requirements may intimidate job seekers and discourage them from applying for the jobs without even open the job details to review. Moreover, candidates might assume that jobs from larger companies are highly competitive, leading them to believe that their chances of success are slim, thus result in self-selection out due to time and effort concerns.
Larger organizations often have more structured and formalized processes, including strict policies and procedures. While this can provide stability and consistency, it might also mean less flexibility in certain aspects of work, decision-making, or adapting to individual needs.
Small companies often have the ability to focus on niche markets or specialized areas within an industry. This niche focus allows them to tailor their offerings more precisely and attract candidates with specific skills or interests. So it will attract people who are interested in the market, especially if it is emerging.
Small companies may offer more unique opportunities because there are fewer hierarchical levels in small companies. This company structure offers the employees more chance to involve in decision-making, thus give them more chance to show their skills or talent.
In small companies, employees may have more chances to engage in various tasks and projects, contributing to a broader skill set and a more dynamic work experience. This variety can be attractive to individuals who value diversity in their work.
Overall, which skills are most in demand? And later, give showcase the in-demand skills industry based.
Why is this question important?
The industry-based in-demand skills chart provides a wealth of insights into the labor market's current state and future directions. It highlights key areas for professional development, industry trends, and potential gaps between supply and demand for various skills.
We aim to uncover:
Skill Ecosystem: Skill bubbles by industry reveal growth areas, indicating where job seekers should focus their upskilling efforts.
Career Mapping: Guides professionals on in-demand skills, aligning their career development with market trends.
Strategy and development: Directs educational and work force strategies towards addressing the demand for niche skills.
What are we doing about it?
We are aggregating job posting data to identify the frequency of skills demanded across all industries present in the dataset post which we are finding are the in-demand skills of top 20 booming industries. This involves relevant visulisations of the data that will further allow us to observe which skills sought after and which are more niche.
In this question, bubble charts are used to visualize the which skills are the most in-demand skill industry wise.
Why take a Bubble Chart?
# Extracting relevant columns
skills_columns = ['Skill 1', 'Skill 2', 'Skill 3']
company_columns = ['Job_Title', 'Company_Name']
skills_series = pd.concat([job_postings[col] for col in skills_columns]) # Combining skills from "Skill 1," "Skill 2," and "Skill 3" columns
skills_count = skills_series.value_counts() # Counting the occurrences of each skill
skills_count = skills_count.dropna() # Filtering out NaN values
# Creating a DataFrame for the Bubble Chart
bubble_data = pd.DataFrame({
'Skill': skills_count.index, # Assigning skill name to skill column
'Frequency': skills_count.values, # Assigning skill frequencies to frequency column
})
# Creating a Bubble Chart using plotly
fig = px.scatter(bubble_data, x='Skill', y='Frequency', size='Frequency', color='Frequency', # Defining parameters
labels={'Frequency': 'Skill Frequency', 'Skill': 'Skill'}, # Customizing axis labels
title='In-Demand Skills', # Providing title
size_max=80, # Setting bubble size
hover_name=None,
template='plotly_dark', # For Dark Theme
color_continuous_scale=px.colors.sequential.Viridis, # Setting Color Scale
)
fig.update_traces(marker=dict(line=dict(width=2, color='white'))) # Setting bubble border width and color
# Customizing the layout
fig.update_layout(
xaxis=dict(title='Skill', showgrid=False), # Removing x-axis grid
yaxis=dict(title='Skill Frequency', showgrid=False), # Removing y-axis grid
legend=dict(title='Frequency'),
title_x=0.5, # Aligning title
font=dict(size=13.5, color='white'), # Setting font style and size
)
fig.show()
Based on this above Visualization, the top required skills amongst all other skills across are:
Information Technology:
The continuous digital transformation of business operations is the reason that IT skills are so ubiquitous across industries. As businesses depend more and more on technological solutions, IT specialists are essential to establishing and maintaining the operation of these systems. IT skills are essential for increasing productivity and maintaining competitiveness in the modern business environment, from data analysis to process automation.
Furthermore, professionals may adapt to the evolving landscape of emerging technologies due to the versatility of IT skills. Artificial intelligence, cloud computing, and expertise in cybersecurity are now especially important since they help companies use cutting-edge technology to their advantage and remain resilient against cyberattacks.
Sales:
In every industry, having strong sales skills is essential for generating revenue and maintaining company expansion. Salespeople are in the forefront of generating revenue and make a substantial financial contribution to a business. Sales abilities are crucial for companies looking to succeed in competitive marketplaces because they facilitate the development and maintenance of customer relationships as well as efforts to expand their market.
Moreover, the dynamic landscape of sales requires adapting to digital channels and utilizing data-driven perspectives to develop more focused and efficient marketing approaches. In today's sales environment, professionals who possess a combination of old-fashioned relationship-building abilities and new, analytics-driven methods are especially valuable.
Management:
Every organization needs effective management to run smoothly and be successful. To accomplish organizational objectives, competent managers demonstrate leadership, make strategic choices, and maximize resources. Managing a company's overall health and direction requires a wide range of competencies, from team leadership to strategic planning.
Furthermore, competent managers possess the ability to cultivate a favourable work environment, encourage staff involvement, and overcome obstacles with courage. Effective management also helps organizations succeed by encouraging innovation within teams, implementing efficient workflows, and being able to adapt to change.
Manufacturing:
Production, operations, and supply chain management skills are essential in manufacturing industries. It is essential to be able to guarantee product quality, streamline production procedures, and innovate in response to market demands. Manufacturing experts make a big difference in a company's capacity to maintain its competitiveness and efficiently produce high-quality products.
Engineering:
Innovative thinking, creative problem-solving, and design are all characteristics of engineering skills. When it comes to fostering innovation and advancing the development of new products, engineers are essential. In a variety of industries, engineering skills are crucial for everything from creating unusual solutions to overcoming difficult problems. Furthermore, engineers frequently take on project management duties to guarantee the smooth running of projects from the beginning to completion.
Modern engineering is interdisciplinary, so working with various teams—such as data scientists, business analysts, and marketers—is also necessary. Innovative projects and solutions are successfully implemented when engineers are able to collaborate across disciplines and effectively communicate complex technical concepts.
# Grouping DataFrame by 'job_id' and 'industry'
industry_job_counts = df_jobs.groupby(['job_id', 'industry']).size().reset_index(name='Job_Count')
# Calculating size of each group, calculating sum of job count, resetting index, and sorting values in descending order
booming_industries = industry_job_counts.groupby('industry')['Job_Count'].sum().sort_values(ascending=False).head(20)
# Extracting the skills and industry columns
skills_columns = ['Skill 1', 'Skill 2', 'Skill 3']
industry_column = 'Company_Industry'
top_skills_data = pd.DataFrame(columns=['Skill', 'Frequency', 'Industry']) # Creating an empty DataFrame to store the top skills for each industry
skills_data = pd.DataFrame(columns=['Skill', 'Frequency', 'Industry']) # Creating an empty DataFrame to store skills data for each industry
# Iterating over each booming industry, We use loop to examine each booming industry within the dataset.
# For every industry we construct a DataFrame to catalog the skills demanded in job postings.
for industry in booming_industries.index:
# Creating a DataFrame for the specific industry
industry_data = pd.DataFrame({
# The stack() method is used to transform a two-dimensional data structure into a one-dimensional series
# value_counts() generates a count of unique skills and their respective frequencies.
'Skill': job_postings.loc[job_postings[industry_column] == industry, skills_columns].stack().value_counts().index,
'Frequency': job_postings.loc[job_postings[industry_column] == industry, skills_columns].stack().value_counts().values,
'Industry': industry
})
# Checking if the industry_data is not empty
if not industry_data.empty:
# Concatenating the industry data to the DataFrame
skills_data = pd.concat([skills_data, industry_data], ignore_index=True)
skills_data['Frequency'] = pd.to_numeric(skills_data['Frequency']) # Converting 'Frequency' column to numeric type
top_skills_data = skills_data.loc[skills_data.groupby('Industry')['Frequency'].idxmax()] # Finding the top skill for each industry
# Creating a Bubble Chart using plotly
fig = px.scatter(top_skills_data, x='Skill', y='Frequency', size='Frequency', color='Industry', # Defining parameters
labels={'Frequency': 'Skill Frequency', 'Skill': 'Top Skill'}, # Customizing axis labels
title='Top Skills for Top 20 Booming Industries', # Setting title
size_max=90, # Setting bubble size
template='plotly_dark', # Dark theme
color_continuous_scale=px.colors.sequential.Plasma, # Color scale
)
fig.update_traces(marker=dict(line=dict(width=2, color='white'))) # Setting bubble border width and color
fig.update_layout(
xaxis=dict(title='Skill', showgrid=False), # Removing x-axis grid
yaxis=dict(title='Skill Frequency', showgrid=False), # Removing y-axis grid
legend=dict(title='Industry'),
title_x=0.5,
font=dict(family='Arial', size=14, color='black'), # Setting font style, size, and color
paper_bgcolor='rgba(0,0,0,0)', # For transparent background
plot_bgcolor='black', # For transparent plot area
)
# Showing the plot
fig.show()
industry_job_counts = df_jobs.groupby(['job_id', 'industry']).size().reset_index(name='Job_Count') #Grouping Dataframe by 'job_id' and 'industry'
# Calculating size of each group, resetting index, calculating sum of job count and sorting value in desc order
booming_industries = industry_job_counts.groupby('industry')['Job_Count'].sum().sort_values(ascending=False)
booming_industries.head(4)
industry Staffing & Recruiting 2184 Information Technology & Services 1754 Hospital & Health Care 1367 Retail 954 Name: Job_Count, dtype: int64
This analysis above uncovers a lot of interesting insights about the skills required in industries. From here, we can infer which skills are most required in an independent booming industry. Starting with:
Staffing & Recruiting:
The skill of 'Information Technology' is the most required skill here amongst all other skills, as technology plays a major role in the Staffing & Recruiting industry's many operations, such as applicant tracking, client management, and candidate sourcing. Proficiency in "Information Technology" is imperative for effectively navigating and utilizing the current technology tools and platforms within the industry. Additionally, the need for professionals with IT skills is rising as a result of the increased emphasis on digital transformation across industries. Staffing and recruiting companies must keep up with new technologies in order to remain competitive, which makes having IT expertise essential. One more reason is that staffing professionals can assess candidates suitability for particular roles and gain a better understanding of client needs and industry trends by having IT skills. They can also communicate with IT professionals more effectively, learn about their career goals, and evaluate them. IT professionals can offer insightful thoughts on the constantly changing information technology landscape. Having this knowledge will help us stay up to date on talent availability, compensation expectations, and industry trends.
Information Technology & Services:
Here too the skill of 'Information Technology' is the most required skill here amongst all other skills, as digital solutions, software, and technology are at the core of the information technology and services sector. Therefore, in order for professionals in this area to comprehend and contribute to the fundamental functions of information technology, they must possess a firm foundation in the field. The demand for it is further driven by the fact that the IT & Services sector provides a wide range of services, such as cybersecurity, cloud computing, software development, and system integration, and the field of "information technology" is broad and versatile, covering a variety of industry domains. Also, for efficient client engagement and communication, a thorough understanding of IT is essential. Innovation and problem-solving are key components of the industry's success, and IT specialists are essential in advancing technology and tackling challenging issues. Their knowledge is crucial for utilizing cutting-edge technology like artificial intelligence and machine learning algorithms to examine enormous databases in search of anomalies and inconsistencies that could point to fraudulent activity.
Hospital & Health Care:
The 'Healthcare and Public Relations' skill set is highly sought after in this area since it is essential to establishing positive relationships, promoting effective communication, and establishing trust within the healthcare ecosystem. They can also control the flow of information, respond to worries, and reassure the public, all of which strengthen the healthcare organization's overall resilience. In order to promote a positive perception of the healthcare facility, HCPR professionals are essential in the development and execution of outreach programs, health education efforts, and community involvement activities.Finally, it involves collaborating together with outside parties to advance healthcare legislation, support public health programs, and increase awareness, including advocacy groups, government agencies, and the media.
Retail:
Proficiency in sales is essential for generating revenue. Salespeople are essential to the transaction-driven retail industry because they turn leads into customers, which directly affects the bottom line of these companies. Their proficiency in attracting and keeping customers is what keeps them profitable and competitive in the market. Professionals in sales also have a thorough awareness of the goods and services they represent. This information not only helps customers make well-informed decisions, but it also promotes customer pleasure and trust, all of which are essential for creating enduring customer relationships. Additionally, salespeople play a critical role in understanding clients needs, developing good and customized relationships with them, and making recommendations that are specifically designed to meet those needs. By doing this, they greatly increase client loyalty and pleasure. Sales personnel now have to negotiate online platforms due to the growth of e-commerce, which is a factor in the digital transformation of retail organizations. A well-rounded sales professional is able to switch between digital and physical sales channels with ease.
Our comprehensive analysis of job postings data from LinkedIn has provided valuable insights into remote work trends, in-demand skills, and the alignment between job visibility and application engagement. In the course of this analysis we have identified the job types with the highest demand and those receiving the most applications, revealing critical gaps between labor supply and market needs. This project also dwelt into dynamic nature of the job market but also offered actionable data that can empower job seekers, guide companies in talent acquisition, and inform educational institutions for curriculum development to bridge skill gaps.
Overall, this project serves as a strategic tool for all stakeholders in the ecosystem in navigating the complexities of the current landscape of the job market.
"We pledge on my honor that we have not given nor received any unauthorized assistance on this project."
-- Group9-Jobless Jesters